Before starting with ggplot here a list of basic plotting functions:
rm(list=ls(all=T))
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
diamonds <- data.frame(ggplot2::diamonds)
diamonds <- sample_n(diamonds, 10000,replace = F) # we sampled with function sample_n of dplyr library 10000 out of 50000+ observations
Start with a simple plot of diamonds. The syntax for ggplots follow some rules, you have aestetics which is something you see. The aesthetics are axis, color, fill, linetype, shape,size etc, all variables/fields are passed here. Other object that you have in ggplot are geometric objects every function which starts with geom_. They can be bar, line, points, box plot etc. It is possible to add object with + operator
To plot in x and y axis. You don’t need to "quote" the variable. Also you don’t need to explicit x and y, first argument passed is x the second is y.
This plot just create the axis, note that we assing the plot to a particular variable p0
p0 <- ggplot(diamonds, aes(x=carat, y=price))
p0
Draw poins based on the two coordinates given by carat and price, note that we assign the plot with points to a new variable but we could also call the previous plot and add something to it. The plot “plotted” is p0+geom_point, but if you call p1 is basically the same plot
p1 <- ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
p0 + geom_point()
On ggplot you can color what you are plotting by calling another variable (better if factor variable) in this case we are coloring by the cut type of the diamonds
p1 <- ggplot(diamonds, aes(x=carat, y=price, col=cut)) + geom_point()
p1
You can also combine color and size of point we are keeping the color and adding a now format based on size assigned by variable price
p2 <- ggplot(diamonds, aes(x=carat, y=price, col=cut)) + geom_point(aes(size=price))
p2
Save the plot. Note that you can save the plot as pdf, tiff, png and so on. Remember that you need to set the name of the format you want to store the plot both as function and extension. You can set width and the height, but also (depending on format), type of size (cm or pixel), resolution in dpi etc.
png('myplot.png',width = 1920,height =1080,res = 100)
p1
dev.off()
From now on I will not assign plot to a variable
We can change aestetics if we want to highlights the cut or another we would rather see how the quality of the color or cut of the diamond affects the price, we can change the aesthetic. Here in “aes” we change “cut” to “color”.
ggplot(diamonds, aes(x=carat, y=price, color=color)) + geom_point()
Now, what if we want to see the effect of both color and cut? We can use a fourth aesthetic, such as the size of the points. So here we have color representing the clarity. Let’s add another aesthetic let’s say “size=cut.”
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point(aes(size=cut,shape = color))
## Warning: Using shapes for an ordinal variable is not advised
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 505 rows containing missing values (geom_point).
Now the size of every point is determined by the cut even while the color is still determined by the clarity. Similarly, we could use the shape to represent the cut:
You can change the theme of the plot, for example to plot without the grey background type.
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point(aes(size=cut,shape = color)) +theme_classic()
## Warning: Using shapes for an ordinal variable is not advised
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 505 rows containing missing values (geom_point).
Instead of points we can draw a line which basically cross all the points
ggplot(diamonds, aes(x=x, y=price, color=clarity)) + geom_line()
You can run regression directly on the fly on our dataset, for example we want to see it the length (x) and the width (y) are somewhat correlated (we expect that they are highly correlated)
ggplot(diamonds, aes(x=x, y=y)) + geom_point() + geom_smooth(se=T)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Note that if you don’t add argument to geom smooth by default you will have standard errors plotted and R will choose automatically the regression method from ‘lm’, ‘loess’, ‘glm’ see the help
# This case we are using glm as method and we decide to not plot standard errors
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth(se=FALSE, method="glm")
If you used a color aesthetic, ggplot will create one smoothing curve for each color. For example, if we add “color=clarity”:
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) +
geom_point() + geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Now the output is telling you what is the method used and the formula applied to the model Use method = ... to change the smoothing method.
It allows you to wraps many plot together based on a particular variable. We add facet_grid and we put a tilde (~) altgr+126 on windows, and then the attribute (variable) we would like to divide the plots by, here “clarity.” Note that we could add the facet_grid to the plot p1 saved before
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) +
geom_point() + theme_classic() + facet_grid(~ cut)
You can also facet yout plot for more than factor, for example we can split by color the clarity and then facetwrap by cut and color using the following syntax The syntax in facet means that left (~) right, with tilde means “is explained by”
ggplot(diamonds, aes(x=price, y=carat, color=color)) + geom_point(aes(size=table))+
facet_grid(cut ~ clarity)+ theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))
Now let’s zoom in on this. We divided it into 56 subplots, (to many, it does not make sense in terms of analysis). In each plot we have on the axis two price and carat, colored by color of diamond, sized by table and the we divided by cut and clarity. More over to not overlap the xlabel we added a theme function of ggplot which rotates the label by a given angle ant adjust vertically and horizontally. (see section below)
Use another dataset
mpg <- (mtcars)
mpg$carb <- as.factor(mpg$carb)
mpg$cyl <- as.factor(mpg$cyl)
ggplot(mpg, aes(disp, hp, color=carb)) + geom_point() +
facet_grid(cyl~gear)
Note that for convenience we transformed
carb and cyl as factor. If you do not transform a numerical variable as factor instead to have a color for each of value presents in the dataframe your color will be a gradient.
mpg <- (mtcars)
mpg$carb <- as.factor(mpg$carb)
mpg$cyl <- as.factor(mpg$cyl)
ggplot(mpg, aes(disp, hp, color=gear)) + geom_point() +
facet_grid(cyl~carb)
Add a title and a subtitle on a plot
ggplot(diamonds, aes(x=x, y=z)) + geom_point() + ggtitle("My scatter plot",subtitle = 'subtitle of scatterplot')
Name the labels of the axis
ggplot(diamonds, aes(x=carat, y=table)) + geom_point() + ggtitle("My scatter plot") + ylab('Width of top of diamond relative to widest point(43-95)')
You can also set limit on the axis with the xlim and ylim
ggplot(diamonds, aes(x=x, y=y)) + geom_point() +
ggtitle("My scatter plot") + xlab("length (mm)") + ylab("width (mm,)") +
xlim(3, 6) + ylim(4,8)
## Warning: Removed 4346 rows containing missing values (geom_point).
Note that we received a warning, which says that due to axis delimitation part of observation cannot be plotted
We can do some simple manipulation on numerical data when we are drawing the plot without the need of create a new field or store in a new variable the result.
In this case we want to log of price
ggplot(diamonds, aes(x=carat, y=log(price))) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)") + ylab('log (price)')
The above y=log(price) is basically the same of create a new variable based on this diamonds$logprice <- log(diamonds$price)
However suggestion is not to use log directly, but apply a particular function of ggplot which scales keeping the axis values
ggplot(diamonds, aes(x=carat, y=(price))) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)") + ylab('log (price)') + scale_y_log10()
You can plot the histogram of just one variable at the time, eg x or y
ggplot(diamonds, aes(x=price, group=cut)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
And change the bandwidth of each bin
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=2000)
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=200)
You can also increase the number of bins
ggplot(diamonds, aes(x=price, color=cut)) + geom_histogram(bins = 1000)
It is possible also color the bins by filling with the count of the price
ggplot(diamonds, aes(x=price)) + geom_histogram(aes(fill=cut))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Also we can apply faceting to plot with histograms
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=10, aes(fill=clarity)) + facet_wrap(~ cut)
Schema of boxplot
## B. BOXPLOT
##
## outliers whisker median whisker
## | | |
## V V V
## +-----------+
## o oo |-----| | |------------| ooo oo o
## +-----------+
## /\ /\
## || (box) ||
## 1 q 3 q
##
## The hinges equal the quartiles for odd n (where n <- length(x)) and
## differ for even n.
ggplot(diamonds, aes(x=cut, y=price)) + geom_boxplot()
#Plot the boxplot for cut type, and mark eventual outlier with red, this case you can pass manually the color, note that the two plot have two different scale.
ggplot(diamonds, aes(x=cut, y=price)) + geom_boxplot(outlier.colour = 'red') + scale_y_log10()
The width at each point in this violin plot represents the frequency of that price. So these bumps show the prices that are more common, and we can see that indeed within some colors there ’is bimodality- there are multiple points that are common that a boxplot did not represent.
ggplot(diamonds, aes(x=color, y=price)) + geom_violin() + scale_y_log10()
Bar plots are common bar which allow to compare two variable instead of one (as in histogram)
expenses <- data.frame(cost=c(350, 150, 100, 120, 140),
descr=c("rent", "utilities", "fuel", "food", "leisure"))
ggplot(data=expenses, aes(x=descr, y=cost)) +
geom_bar(stat="identity")
You can fill also barplot
ggplot(data=expenses, aes(x=descr, y=cost)) +
geom_bar(stat="identity", width=0.5)
And change width of the bar
ggplot(data=expenses, aes(x=descr, y=cost)) +
geom_bar(stat="identity", width=0.5)
Or you can color bar
ggplot(diamonds, aes(x=carat, y=price)) + #<--- see the dataset used
geom_bar(stat="identity", color="blue", fill="white")
Another color to fill the bar
ggplot(data=expenses, aes(x=descr, y=cost)) +
geom_bar(stat="identity", fill="steelblue")+
theme_minimal()
You can add text here (and in any plot you want) with the function geom_text, you can as for the axis modify the label, the position adjustement, the color, and the size.
ggplot(data=expenses, aes(x=descr, y=cost)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=descr), vjust=1.6, color="white", size=3.5)+
theme_minimal()